{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "70e9b619",
   "metadata": {},
   "source": [
    "# MarkdownHeaderTextSplitter\n",
    "\n",
    "### Motivation\n",
    "\n",
    "Many chat or Q+A applications involve chunking input documents prior to embedding and vector storage.\n",
    "\n",
    "[These notes](https://www.pinecone.io/learn/chunking-strategies/) from Pinecone provide some useful tips:\n",
    "\n",
    "```\n",
    "When a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text.\n",
    "```\n",
    " \n",
    "As mentioned, chunking often aims to keep text with common context together.\n",
    "\n",
    "With this in mind, we might want to specifically honor the structure of the document itself.\n",
    "\n",
    "For example, a markdown file is organized by headers.\n",
    "\n",
    "Creating chunks within specific header groups is an intuitive idea.\n",
    "\n",
    "To address this challenge, we can use `MarkdownHeaderTextSplitter`.\n",
    "\n",
    "This will split a markdown file by a specified set of headers. \n",
    "\n",
    "For example, if we want to split this markdown:\n",
    "```\n",
    "md = '# Foo\\n\\n ## Bar\\n\\nHi this is Jim  \\nHi this is Joe\\n\\n ## Baz\\n\\n Hi this is Molly' \n",
    "```\n",
    " \n",
    "We can specify the headers to split on:\n",
    "```\n",
    "[(\"#\", \"Header 1\"),(\"##\", \"Header 2\")]\n",
    "```\n",
    "\n",
    "And content is grouped or split by common headers:\n",
    "```\n",
    "{'content': 'Hi this is Jim  \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
    "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n",
    "```\n",
    "\n",
    "Let's have a look at some examples below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "19c044f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import MarkdownHeaderTextSplitter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2ae3649b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'content': 'Hi this is Jim  \\nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}\n",
      "{'content': 'Hi this is Lance', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}}\n",
      "{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}\n"
     ]
    }
   ],
   "source": [
    "markdown_document = \"# Foo\\n\\n    ## Bar\\n\\nHi this is Jim\\n\\nHi this is Joe\\n\\n ### Boo \\n\\n Hi this is Lance \\n\\n ## Baz\\n\\n Hi this is Molly\"\n",
    "\n",
    "headers_to_split_on = [\n",
    "    (\"#\", \"Header 1\"),\n",
    "    (\"##\", \"Header 2\"),\n",
    "    (\"###\", \"Header 3\"),\n",
    "]\n",
    "\n",
    "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
    "md_header_splits = markdown_splitter.split_text(markdown_document)\n",
    "for split in md_header_splits:\n",
    "    print(split)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9bd8977a",
   "metadata": {},
   "source": [
    "Within each markdown group we can then apply any text splitter we want. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "480e0e3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "markdown_document = \"# Intro \\n\\n    ## History \\n\\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \\n\\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \\n\\n ## Rise and divergence \\n\\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \\n\\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \\n\\n #### Standardization \\n\\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \\n\\n ## Implementations \\n\\n Implementations of Markdown are available for over a dozen programming languages.\"\n",
    "\n",
    "headers_to_split_on = [\n",
    "    (\"#\", \"Header 1\"),\n",
    "    (\"##\", \"Header 2\"),\n",
    "]\n",
    "\n",
    "# MD splits\n",
    "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
    "md_header_splits = markdown_splitter.split_text(markdown_document)\n",
    "\n",
    "# Char-level splits\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "chunk_size = 10\n",
    "chunk_overlap = 0\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)\n",
    "\n",
    "# Split within each header group\n",
    "all_splits=[]\n",
    "all_metadatas=[]    \n",
    "for header_group in md_header_splits:\n",
    "    _splits = text_splitter.split_text(header_group['content'])\n",
    "    _metadatas = [header_group['metadata'] for _ in _splits]\n",
    "    all_splits += _splits\n",
    "    all_metadatas += _metadatas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3f5d775e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Markdown[9'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_splits[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "33ab0d5c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Header 1': 'Intro', 'Header 2': 'History'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_metadatas[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcf70760",
   "metadata": {},
   "source": [
    "### Use case\n",
    "\n",
    "Let's appy `MarkdownHeaderTextSplitter` to a Notion page [here](https://rlancemartin.notion.site/Auto-Evaluation-of-Metadata-Filtering-18502448c85240828f33716740f9574b?pvs=4) as a test.\n",
    "\n",
    "The page is downloaded as markdown and stored locally as shown [here](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/notion)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "73313d6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load Notion database as a markdownfile file\n",
    "from langchain.document_loaders import NotionDirectoryLoader\n",
    "loader = NotionDirectoryLoader(\"../Notion_DB_Metadata\")\n",
    "docs = loader.load()\n",
    "md_file=docs[0].page_content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "6fa341d7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'content': 'We previously introduced [auto-evaluator](https://blog.langchain.dev/auto-evaluator-opportunities/), an open-source tool for grading LLM question-answer chains. Here, we extend auto-evaluator with a [lightweight Streamlit app](https://github.com/langchain-ai/auto-evaluator/tree/main/streamlit) that can connect to any existing Pinecone index. We add the ability to test metadata filtering using `SelfQueryRetriever` as well as some other approaches that we’ve found to be useful, as discussed below.  \\n[ret_trim.mov](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/ret_trim.mov)',\n",
       " 'metadata': {'Section': 'Evaluation'}}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Let's create groups based on the section headers\n",
    "headers_to_split_on = [\n",
    "    (\"###\", \"Section\"),\n",
    "]\n",
    "markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)\n",
    "md_header_splits = markdown_splitter.split_text(md_file)\n",
    "md_header_splits[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42d8bb9b",
   "metadata": {},
   "source": [
    "Now, we split the text in each group and keep the group as metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "a9831de2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define our text splitter\n",
    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
    "chunk_size = 500\n",
    "chunk_overlap = 50\n",
    "text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)\n",
    " \n",
    "# Create splits within each header group\n",
    "all_splits=[]\n",
    "all_metadatas=[]\n",
    "for header_group in md_header_splits:\n",
    "    _splits = text_splitter.split_text(header_group['content'])\n",
    "    _metadatas = [header_group['metadata'] for _ in _splits]\n",
    "    all_splits += _splits\n",
    "    all_metadatas += _metadatas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "b5691ee5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'In these cases, semantic search will look for the concept `episode 53` in the chunks, but instead we simply want to filter the chunks for `episode 53` and then perform semantic search to extract those that best summarize the episode. Metadata filtering does this, so long as we 1) we have a metadata filter for episode number and 2) we can extract the value from the query (e.g., `54` or `252`) that we want to extract. The LangChain `SelfQueryRetriever` does the latter (see'"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_splits[6]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "e1dfb405",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Section': 'Motivation'}"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_metadatas[6]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79868606",
   "metadata": {},
   "source": [
    "This sets us up well do perform metadata filtering based on the document structure.\n",
    "\n",
    "Let's bring this all togther by building a vectorstore first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "143d7347",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install chromadb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "cbcb917a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build vectorstore\n",
    "from langchain.vectorstores import Chroma\n",
    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
    "embeddings = OpenAIEmbeddings()\n",
    "vectorstore = Chroma.from_texts(texts=all_splits,metadatas=all_metadatas,embedding=OpenAIEmbeddings())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f6031fc",
   "metadata": {},
   "source": [
    "Let's create a `SelfQueryRetriever` that can filter based upon metadata we defined."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "5b1b6a75",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create retriever \n",
    "from langchain.llms import OpenAI\n",
    "from langchain.retrievers.self_query.base import SelfQueryRetriever\n",
    "from langchain.chains.query_constructor.base import AttributeInfo\n",
    "\n",
    "# Define our metadata\n",
    "metadata_field_info = [\n",
    "    AttributeInfo(\n",
    "        name=\"Section\",\n",
    "        description=\"Headers of the markdown document that organize the ideas\",\n",
    "        type=\"string or list[string]\",\n",
    "    ),\n",
    "]\n",
    "document_content_description = \"Headers of the markdown document\"\n",
    "\n",
    "# Define self query retriver\n",
    "llm = OpenAI(temperature=0)\n",
    "sq_retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info, verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d0dbed8",
   "metadata": {},
   "source": [
    "Now we can fetch chunks specifically from any section of the doc!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "6c37fe1b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[Document(page_content='![Untitled](Auto-Evaluation%20of%20Metadata%20Filtering%2018502448c85240828f33716740f9574b/Untitled.png)', metadata={'Section': 'Introduction'}),\n",
       " Document(page_content='Q+A systems often use a two-step approach: retrieve relevant text chunks and then synthesize them into an answer. There many ways to approach this. For example, we recently [discussed](https://blog.langchain.dev/auto-evaluation-of-anthropic-100k-context-window/) the Retriever-Less option (at bottom in the below diagram), highlighting the Anthropic 100k context window model. Metadata filtering is an alternative approach that pre-filters chunks based on a user-defined criteria in a VectorDB using', metadata={'Section': 'Introduction'}),\n",
       " Document(page_content='on a user-defined criteria in a VectorDB using metadata tags prior to semantic search.', metadata={'Section': 'Introduction'})]"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Test\n",
    "question=\"Summarize the Introduction section of the document\"\n",
    "sq_retriever.get_relevant_documents(question)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb0efebd",
   "metadata": {},
   "source": [
    "Now, we can create chat or Q+A apps that are aware of the explict document structure. \n",
    "\n",
    "Of course, semantic search without specific metadata filtering would probably work reasonably well for this simple document.\n",
    "\n",
    "But, the ability to retain document structure for metadata filtering can be helpful for more complicated or longer documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "3b40e24e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "query='Introduction' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Introduction') limit=None\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'The document discusses different approaches to retrieve relevant text chunks and synthesize them into an answer in Q+A systems. One of the approaches is metadata filtering, which pre-filters chunks based on user-defined criteria in a VectorDB using metadata tags prior to semantic search. The Retriever-Less option, which uses the Anthropic 100k context window model, is also mentioned as an alternative approach.'"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain.chains import RetrievalQA\n",
    "from langchain.chat_models import ChatOpenAI\n",
    "llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)\n",
    "qa_chain = RetrievalQA.from_chain_type(llm,retriever=sq_retriever)\n",
    "qa_chain.run(question)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "dfeeb327",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "query='Testing' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='Section', value='Testing') limit=None\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'The Testing section of the document describes how the performance of the SelfQueryRetriever was evaluated using various test cases. The tests were designed to evaluate the ability of the SelfQueryRetriever to correctly infer metadata filters from the query using metadata_field_info. The results of the tests showed that the SelfQueryRetriever performed well in some cases, but failed in others. The document also provides a link to the code for the auto-evaluator and instructions on how to use it. Additionally, the document mentions the use of the Kor library for structured data extraction to explicitly specify transformations that the auto-evaluator can use.'"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "question=\"Summarize the Testing section of the document\"\n",
    "qa_chain.run(question)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
